---
title: "An Order-of-Magnitude Larger and Faster Training with ZeRO-2"
excerpt: ""
tags: training ZeRO English
date: 2020-05-19 01:00:00
toc: false
---

ZeRO-2 expands the scope of memory optimizations in the original ZeRO by
tackling the full spectrum of memory consumption during training. More
specifically, ZeRO-2 introduces new technology to reduce the memory footprint
of gradients, activation memory, and fragmented memory, in addition to
optimizer state memory optimization in the original ZeRO. Altogether, the
memory savings empower DeepSpeed to improve the scale and speed of deep
learning training by an order of magnitude. More concretely, ZeRO-2 allows
training models as large as 170 billion parameters up to 10x faster compared
to state of the art.

For more information on ZeRO-2, see our [blog post](https://www.microsoft.com/en-us/research/blog/ZeRO-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/).

For more information on how to use ZeRO-2, see an example of training GPT family of models in this [tutorial](/tutorials/megatron/).

For a technical overview, see our [technical report](https://arxiv.org/abs/1910.02054).
